[Day06] 武器Get - 打造樣板吧

14th鐵人賽 machine learning python3

ironcat45

2022-09-21 20:40:20

1171 瀏覽

分享至

目前為止已經學完了所有Data Preprocessing 會用到的技能
讓我們來複習一下

前三項比較像是基本python 的語法

第四項開始講如何處理缺失的數據, 可以用平均數, 最大值, 最小值都可以

第五項講分類數據, 因為有些數據是string 不是數字, 因此可以將某些數據透過python library 轉化為數字或是二元數字串(ex: 001, 010, 100), 轉換成數字後才可以導入模型來計算

第六項講訓練集和測試集, 訓練集可以讓機器利用model 學習資料間的關係, 測試集則是讓機器考試, 看看訓練玩的模型準不準確

第七項特徵縮放是讓資料彼此的權重不要差太大, 有點像是當國文的權重佔5, 英文的權重佔1, 這樣就差太多, 透過python libarary 可以讓彼此權重都是1 或是接近, 讓計算上彼此的影響力相當

第八項就是本頁要做的了, 將上述的七項所有程式碼列出變成樣板, 後面的學習可以依此樣板來實作

把之前所有的程式片段組合起來吧

# -*- coding: utf-8 -*-
"""
Editor: Ironcat45
Title: For data processing practice
Date: 2022/09/21
"""

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

"""
1. Get and import the dataset
   Use read_csv() and iloc() to get and import the dataset
   Read the data[:][0-2] as array x
   Read the data[:][3] as array y
"""
dataset = pd.read_csv('Data.csv')
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values

"""
2. Deal with the missing data
   Use SimpleImputer to impute the missing data
   by calling fit() and transform()
"""
#from sklearn.preprocessing import Imputer
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
# only deal with index 1, 2 (not includes 3)
imputer = imputer.fit(x[:,1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])

"""
3.1 Deal with the categorical data
   Use LabelEncoder to transfer the data
   (transfer from string to digit)
"""
# transfer label to digit
from sklearn.preprocessing import LabelEncoder
lblenc_x = LabelEncoder()
# init the label encoder and transfer it
# get all the rows with the colmun 0
x[:,0] = lblenc_x.fit_transform(x[:,0])

lblenc_y = LabelEncoder()
y = lblenc_y.fit_transform(y)

"""
3.2. Deal with the categorical data <dummy encoding>
   Use the OneHotEncoder and ColumnTransformer to encode
   a colmun of data
"""
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([("Country", OneHotEncoder(), [0])], remainder = 'passthrough')
x = ct.fit_transform(x)

"""
4.  Split the dataset
"""
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=0)

"""
5. Feature scaling
"""
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

Template 模板

模板中我們只保留了1, 4
後面的學習可能只會用到這些
不過2,3,4,5 的應用還是要依實際情況而定
老師針對這部份沒有說明太清楚
之後再查資料補充

# -*- coding: utf-8 -*-
"""
Editor: Ironcat45
Title: For data processing practice
Date: 2022/09/21
"""

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

"""
1. Get and import the dataset
   Use read_csv() and iloc() to get and import the dataset
   Read the data[:][0-2] as array x
   Read the data[:][3] as array y
"""
dataset = pd.read_csv('Data.csv')
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values


"""
4.  Split the dataset
"""
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=0)